Guide to using tidyr

Now that we've learned about dplyr we can begin to learn about tidyr which is a complementary package that will help us create tidy data sets! So what do we mean when we say "tidy data"?

Tidy data is when we have a data set where every row is an observation and every column is a variable, this way the data is organized in such a way where every cell is a value for a specific variable of a specific observation. Having your data in this format will help build an understanding of your data and allow you to analyze or visualize it quickly and efficiently.

After viewing this lecture, you can reference this handy cheatsheet on Data Wrangling

Installing tidyr

In [ ]:
install.packages('tidyr',repos = 'http://cran.us.r-project.org')
In [5]:
library(tidyr)
library(data.table)

Data.frames versus data.tables

All data.tables are also data.frames. Loosely speaking, you can think of data.tables as data.frames with extra features.

data.frame is part of base R.

data.table is a package that extends data.frames. Two of its most notable features are speed and cleaner syntax.

However, that syntax for a data.table is different from the standard R syntax for data.frame while being hard for the untrained eye to distinguish at a glance. Therefore, if you read a code snippet and there is no other context to indicate you are working with data.tables and try to apply the code to a data.frame it may fail or produce unexpected results.

So what are some of the practical differences? Here are a few:

  • much faster and very intuitive by operations
  • You won't accidentally print out a huge data.frame with the need to press Ctrl-C, data.table prevents this sort of accident
  • faster and better file reading with fread
  • the package also provides a number of other utility functions, like %between% or rbindlist that make life better
  • pretty much faster for a lot of basic operations, since a lot of data.frame operations copy the entire thing needlessly

Using tidyr

We'll cover some of the most useful functions in tidyr. Including the following:

  • gather()
  • spread()
  • separate()
  • unite()

Which basically perform the following actions:

Example Data Set

Let's create some fake data that needs to be cleaned using tidyr

In [41]:
comp <- c(1,1,1,2,2,2,3,3,3)
yr <- c(1998,1999,2000,1998,1999,2000,1998,1999,2000)
q1 <- runif(9, min=0, max=100)
q2 <- runif(9, min=0, max=100)
q3 <- runif(9, min=0, max=100)
q4 <- runif(9, min=0, max=100)

df <- data.frame(comp=comp,year=yr,Qtr1 = q1,Qtr2 = q2,Qtr3 = q3,Qtr4 = q4)
In [55]:
df
Out[55]:
compyearQtr1Qtr2Qtr3Qtr4
11199855.5911872.4986799.3018910.59585
21199994.9690475.0737768.448993.10778
31200096.6044764.8713130.9995111.94151
42199843.6615992.2837135.4059242.76989
52199932.0109370.8862353.4457581.37271
62200010.318847.8156279.6316819.96017
73199859.4295121.9303974.7526549.7312
83199921.845220.717847.14217267.33088
93200062.0337598.8079780.5436632.46341

Gather() and Spread()

Sometimes people like to think of these operations as analogous to pivot tables in excel, let's see some examples of how to use them:

gather()

The gather() function will collapse multiple columns into key-pair values. The data frame above is considered wide since the time variable (represented as quarters) is structured such that each quarter represents a variable. To re-structure the time component as an individual variable, we can gather each quarter within one column variable and also gather the values associated with each quarter in a second column variable.

In [48]:
# Using Pipe Operator
head(df %>% gather(Quarter,Revenue,Qtr1:Qtr4))
Out[48]:
compyearQuarterRevenue
111998Qtr155.59118
211999Qtr194.96904
312000Qtr196.60447
421998Qtr143.66159
521999Qtr132.01093
622000Qtr110.3188
In [47]:
# With just the function
head(gather(df,Quarter,Revenue,Qtr1:Qtr4))
Out[47]:
compyearQuarterRevenue
111998Qtr155.59118
211999Qtr194.96904
312000Qtr196.60447
421998Qtr143.66159
521999Qtr132.01093
622000Qtr110.3188

spread()

This is the complement of gather(), which is why its called spread():

In [51]:
stocks <- data.frame(
  time = as.Date('2009-01-01') + 0:9,
  X = rnorm(10, 0, 1),
  Y = rnorm(10, 0, 2),
  Z = rnorm(10, 0, 4)
)
stocks
Out[51]:
timeXYZ
12009-01-011.781885-0.2303815-3.270052
22009-01-02-1.7489933.0162160.3554105
32009-01-030.091888330.1431827-3.556647
42009-01-041.010212-0.11196860.163945
52009-01-05-0.3455250.32999471.348945
62009-01-06-0.41523160.84371624.36796
72009-01-070.1473563-0.2800663-0.4905551
82009-01-08-0.71511261.1122285.4409
92009-01-09-1.149541-0.2481949-1.730787
102009-01-100.81909351.287082-6.21617
In [52]:
stocksm <- stocks %>% gather(stock, price, -time)
In [53]:
stocksm %>% spread(stock, price)
Out[53]:
timeXYZ
12009-01-011.781885-0.2303815-3.270052
22009-01-02-1.7489933.0162160.3554105
32009-01-030.091888330.1431827-3.556647
42009-01-041.010212-0.11196860.163945
52009-01-05-0.3455250.32999471.348945
62009-01-06-0.41523160.84371624.36796
72009-01-070.1473563-0.2800663-0.4905551
82009-01-08-0.71511261.1122285.4409
92009-01-09-1.149541-0.2481949-1.730787
102009-01-100.81909351.287082-6.21617
In [56]:
stocksm %>% spread(time, price)
Out[56]:
stock2009-01-012009-01-022009-01-032009-01-042009-01-052009-01-062009-01-072009-01-082009-01-092009-01-10
1X1.781885-1.7489930.091888331.010212-0.345525-0.41523160.1473563-0.7151126-1.1495410.8190935
2Y-0.23038153.0162160.1431827-0.11196860.32999470.8437162-0.28006631.112228-0.24819491.287082
3Z-3.2700520.3554105-3.5566470.1639451.3489454.36796-0.49055515.4409-1.730787-6.21617

Separate and Unite

separate()

Given either regular expression or a vector of character positions, separate() turns a single character column into multiple columns.

In [62]:
df <- data.frame(x = c(NA, "a.x", "b.y", "c.z"))
df
Out[62]:
x
1NA
2a.x
3b.y
4c.z
In [64]:
df %>% separate(x, c("ABC", "XYZ"))
Out[64]:
ABCXYZ
1NANA
2ax
3by
4cz

unite()

Unite is a convenience function to paste together multiple columns into one.

In [66]:
head(mtcars)
Out[66]:
mpgcyldisphpdratwtqsecvsamgearcarb
Mazda RX42161601103.92.6216.460144
Mazda RX4 Wag2161601103.92.87517.020144
Datsun 71022.84108933.852.3218.611141
Hornet 4 Drive21.462581103.083.21519.441031
Hornet Sportabout18.783601753.153.4417.020032
Valiant18.162251052.763.4620.221031
In [69]:
unite_(mtcars, "vs.am", c("vs","am"),sep = '.')
Out[69]:
mpgcyldisphpdratwtqsecvs.amgearcarb
Mazda RX42161601103.92.6216.460.144
Mazda RX4 Wag2161601103.92.87517.020.144
Datsun 71022.84108933.852.3218.611.141
Hornet 4 Drive21.462581103.083.21519.441.031
Hornet Sportabout18.783601753.153.4417.020.032
Valiant18.162251052.763.4620.221.031
Duster 36014.383602453.213.5715.840.034
Merc 240D24.44146.7623.693.19201.042
Merc 23022.84140.8953.923.1522.91.042
Merc 28019.26167.61233.923.4418.31.044
Merc 280C17.86167.61233.923.4418.91.044
Merc 450SE16.48275.81803.074.0717.40.033
Merc 450SL17.38275.81803.073.7317.60.033
Merc 450SLC15.28275.81803.073.78180.033
Cadillac Fleetwood10.484722052.935.2517.980.034
Lincoln Continental10.4846021535.42417.820.034
Chrysler Imperial14.784402303.235.34517.420.034
Fiat 12832.4478.7664.082.219.471.141
Honda Civic30.4475.7524.931.61518.521.142
Toyota Corolla33.9471.1654.221.83519.91.141
Toyota Corona21.54120.1973.72.46520.011.031
Dodge Challenger15.583181502.763.5216.870.032
AMC Javelin15.283041503.153.43517.30.032
Camaro Z2813.383502453.733.8415.410.034
Pontiac Firebird19.284001753.083.84517.050.032
Fiat X1-927.3479664.081.93518.91.141
Porsche 914-2264120.3914.432.1416.70.152
Lotus Europa30.4495.11133.771.51316.91.152
Ford Pantera L15.883512644.223.1714.50.154
Ferrari Dino19.761451753.622.7715.50.156
Maserati Bora1583013353.543.5714.60.158
Volvo 142E21.441211094.112.7818.61.142
In [70]:
# Separate is the complement of unite
mtcars %>%
  unite(vs_am, vs, am) %>%
  separate(vs_am, c("vs", "am"))
Out[70]:
mpgcyldisphpdratwtqsecvsamgearcarb
Mazda RX42161601103.92.6216.460144
Mazda RX4 Wag2161601103.92.87517.020144
Datsun 71022.84108933.852.3218.611141
Hornet 4 Drive21.462581103.083.21519.441031
Hornet Sportabout18.783601753.153.4417.020032
Valiant18.162251052.763.4620.221031
Duster 36014.383602453.213.5715.840034
Merc 240D24.44146.7623.693.19201042
Merc 23022.84140.8953.923.1522.91042
Merc 28019.26167.61233.923.4418.31044
Merc 280C17.86167.61233.923.4418.91044
Merc 450SE16.48275.81803.074.0717.40033
Merc 450SL17.38275.81803.073.7317.60033
Merc 450SLC15.28275.81803.073.78180033
Cadillac Fleetwood10.484722052.935.2517.980034
Lincoln Continental10.4846021535.42417.820034
Chrysler Imperial14.784402303.235.34517.420034
Fiat 12832.4478.7664.082.219.471141
Honda Civic30.4475.7524.931.61518.521142
Toyota Corolla33.9471.1654.221.83519.91141
Toyota Corona21.54120.1973.72.46520.011031
Dodge Challenger15.583181502.763.5216.870032
AMC Javelin15.283041503.153.43517.30032
Camaro Z2813.383502453.733.8415.410034
Pontiac Firebird19.284001753.083.84517.050032
Fiat X1-927.3479664.081.93518.91141
Porsche 914-2264120.3914.432.1416.70152
Lotus Europa30.4495.11133.771.51316.91152
Ford Pantera L15.883512644.223.1714.50154
Ferrari Dino19.761451753.622.7715.50156
Maserati Bora1583013353.543.5714.60158
Volvo 142E21.441211094.112.7818.61142

Hopefully you'll find tidyr useful when having to clean up your data!